In order to get the hierarchical clustering dendrograms, the distance was measured first using euclidean distance.
x = c(3,4,2.1,4,7,3,6.1,7,4,3,6.2,7,5,3.5,2.5,3.5,5.5,6,0.5,0.8)
y = c(6.1,2,5,6,3,5,4,2,1.5,2,2,3,5,4.5,6,5.5,4.5,1,1.5,1.2)
data = data.frame(x,y)
dist = dist(data)
dist
## 1 2 3 4 5 6 7
## 2 4.2201896
## 3 1.4212670 3.5510562
## 4 1.0049876 4.0000000 2.1470911
## 5 5.0606324 3.1622777 5.2924474 4.2426407
## 6 1.1000000 3.1622777 0.9000000 1.4142136 4.4721360
## 7 3.7443290 2.9000000 4.1231056 2.9000000 1.3453624 3.2572995
## 8 5.7280014 3.0000000 5.7454330 5.0000000 1.0000000 5.0000000 2.1931712
## 9 4.7074409 0.5000000 3.9824616 4.5000000 3.3541020 3.6400549 3.2649655
## 10 4.1000000 1.0000000 3.1320920 4.1231056 4.1231056 3.0000000 3.6891733
## 11 5.2009614 2.2000000 5.0803543 4.5650849 1.2806248 4.3863424 2.0024984
## 12 5.0606324 3.1622777 5.2924474 4.2426407 0.0000000 4.4721360 1.3453624
## 13 2.2825424 3.1622777 2.9000000 1.4142136 2.8284271 2.0000000 1.4866069
## 14 1.6763055 2.5495098 1.4866069 1.5811388 3.8078866 0.7071068 2.6476405
## 15 0.5099020 4.2720019 1.0770330 1.5000000 5.4083269 1.1180340 4.1182521
## 16 0.7810250 3.5355339 1.4866069 0.7071068 4.3011626 0.7071068 3.0016662
## 17 2.9681644 2.9154759 3.4365681 2.1213203 2.1213203 2.5495098 0.7810250
## 18 5.9169249 2.2360680 5.5865911 5.3851648 2.2360680 5.0000000 3.0016662
## 19 5.2354560 3.5355339 3.8483763 5.7008771 6.6708320 4.3011626 6.1326992
## 20 5.3712196 3.2984845 4.0162171 5.7688820 6.4560050 4.3908997 5.9941638
## 8 9 10 11 12 13 14
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 3.0413813
## 10 4.0000000 1.1180340
## 11 0.8000000 2.2561028 3.2000000
## 12 1.0000000 3.3541020 4.1231056 1.2806248
## 13 3.6055513 3.6400549 3.6055513 3.2310989 2.8284271
## 14 4.3011626 3.0413813 2.5495098 3.6796739 3.8078866 1.5811388
## 15 6.0207973 4.7434165 4.0311289 5.4488531 5.4083269 2.6925824 1.8027756
## 16 4.9497475 4.0311289 3.5355339 4.4204072 4.3011626 1.5811388 1.0000000
## 17 2.9154759 3.3541020 3.5355339 2.5961510 2.1213203 0.7071068 2.0000000
## 18 1.4142136 2.0615528 3.1622777 1.0198039 2.2360680 4.1231056 4.3011626
## 19 6.5192024 3.5000000 2.5495098 5.7218878 6.6708320 5.7008771 4.2426407
## 20 6.2513998 3.2140317 2.3409400 5.4589376 6.4560050 5.6639209 4.2638011
## 15 16 17 18 19
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16 1.1180340
## 17 3.3541020 2.2360680
## 18 6.1032778 5.1478151 3.5355339
## 19 4.9244289 5.0000000 5.8309519 5.5226805
## 20 5.0921508 5.0774009 5.7428216 5.2038447 0.4242641
knitr::include_graphics("hsimple.jpg")
As we can see, the dendrogram itself provides the order of the mergers, since you can see how each element was merged based on the order. For simple linkage, the smallest distance between two points was always chosen.
This is the one generated by R for reference.
clust = hclust(dist, method="single")
plot(clust)
knitr::include_graphics("hcomplete.jpg")
As we can see, the dendrogram itself provides the order of the mergers as well, since you can see how each element was merged based on the order. For complete linkage, the distance was considered when there are no smaller distances between the elements that will be clustered.
This is the one generated by R for reference.
clust2 = hclust(dist, method="complete")
plot(clust2)
set.seed(20)
xOrdered = c(3, 2.1, 3, 5, 4, 4, 7, 6.1, 7, 4, 3, 6.2, 7, 3.5, 2.5, 3.5, 5.5, 6, 0.5, 0.8)
yOrdered = c(6.5, 1, 5, 5, 2, 6, 3, 4, 2, 1.5, 2, 2, 3, 4.5, 6, 5.5, 4.5, 1, 1.5, 1.2)
dataOrdered = data.frame(x=xOrdered, y=yOrdered)
kmeansResult <- kmeans(data, dataOrdered[1:4,], nstart = 20)
kmeansResult$cluster <- as.factor(kmeansResult$cluster)
library(ggplot2)
ggplot(data, aes(x, y, color = kmeansResult$cluster)) + geom_point()
With this code, what is happening is that the four clusters or points that were selected as initial are put first in the data frame so that then we can send them to the kmeans function. It will use these four points for all of its calculations. Also, the initial cluster value is set to 20 because there are 20 points in the data frame. Then, kmeans does it calculations by clustering nearby points first and, when four clusters are achieved, shifting the centroid so that it analyzes all points again with all of this centroids. This process is repeated until the centroids don’t change anymore at the end of the clusterization process.
In this code, the clusters are changed to factors so that each one can be colored in a different color. If they’re left as numeric, ggplot will do a gradient instead of different colors.
messy = read.csv("DM2017.txt", sep = " ", header = FALSE, stringsAsFactors = FALSE)
rowNames = messy[,c(1)][-c(1)]
columnNames = messy[c(1),]
messyStripped = messy[-c(1), -c(1)]
distances = dist(messyStripped)
clusters = hclust(distances, method="complete")
plot(clusters)
orderedNumbers <- clusters$order
finalData = data.frame(rowNames, ids = seq.int(640))
finalData$ids <- factor(finalData$ids,levels=orderedNumbers)
finalData <- finalData[order(finalData$ids),]
write.table(finalData$rowNames, file = "rows.csv", row.names=FALSE, sep=" ", quote = FALSE)
First, the data is extracted from the file. The rows and columns are removed because if we do the hierarchical clustering with them, then it doesn’t work because they aren’t numerics, but strings. The distances are calculated for the remaining data frame, and then with this distance, hierarchical clustering is executed. This is done with the complete method. I tried all of the other methods available and none look as decent as complete. The dendogram is plotted for informative purposes only, as it is actually not useful for this analysis. Then, because I didn’t know how to order the rows directly, I added an ID row so that I could use it to order the row information. To make the usage of this rows easier, I wrote them to a file, which had all of the row IDs in the order that the hierarchical clustering has determined that is the best for it. The end result is this:
knitr::include_graphics("hclust.png")
Even though it’s obviously not as good as the original, I can say with confidence that the image is a lion in the middle of some kind of dry pasture. There’s a tree with green leaves to the right and a tree with dry leaves at the left.
First, he had to do some cleaning like we had to do on the first homeworks. Since he got data from various sources, he had to adapt it in different ways, such as adding an indicator for the type of taxi, such as yellow taxi, green taxi, or Uber.
The first two graphs with the maps are points drawn across a map so that it forms the map itself with stronger and weaker colors. The graphs that deal with pickups in different sectors of New York are just normal line graphs that could be done in R easily. They analyze what is happening with Uber, with filtering specific days and plotting them to see the tendency.
However, the graphs that deal with the airports are like a very cool way of box plots. It starts with the analysis of time that it takes to reach the airport, and then it is expanded to showing the information with quantiles.
Then, for the Die Hard analysis, a normal histogram is made to show the most frequent travel times based on filtered data of time and location. For the weather analysis, there’s nothing particularly noteworthy to notice other than the author changing the measurements to percents so that it would be easier to compare between regular taxis and Uber.
The analysis for late night pickups could maybe be interpreted as some kind of clustering, although here the author has a clear way of sectorizing the pickups, so it’s not a clustering without any parameters, like we do in our assignments. However, it still groups kind of random pickup locations into clusters so that we can see where there are more pickusp and where less.
The B&T is done similarly with a kind of clusterized approach, while a scatter plot over a map was done to get into more details about Murray Hill. Williamsburg analysis also has a regular line graph with an animated scatterplot.
The investment bankers analysis is also made with a regular histogram with the times of arrival of people to a defined location. In the parting thoughts, there are regular line plots for the credit card analysis, with the Cash vs Credit Card by Total Fare Amount graph having several lines, like we do sometimes in ggplot (but these ones look nicer).
In regards to the updates, there are still comparison images using normal line graphs with several lines on top of each other for easy comparison. This is done to analyze the surge of rides done through Uber.
The report can be found here: https://docs.google.com/document/d/1RTAASwbB9Xs_6IKfEtNmDErWQKxld1NPWBTjxurEJW4/edit#heading=h.k2uf3dxko21l
The proposal can be found here, and of course it can be shared with Taxify: https://docs.google.com/document/d/1VWPOJ9_ZofMJ3ijVsO-ErZNnQlA07IjtJLKXuGjOFn4/edit?usp=sharing